All Roads Lead to Philosophy

Examining the ‘Getting to Philosophy’ Phenomena on Wikipedia using Network Analysis

Author

Austin Barish

Published

August 7, 2023

Keywords

Abstract

In this study, I analyze a phenomenon on Wikipedia in which repeatedly clicking “first link” of a webpage invariably takes a user to the Philosophy page. I examine the percent of pages on Wikipedia in which this idea holds true in an effort to understand how Wikipedia’s network is structured and what that means for its user navigability and understanding. Previous research indicates that users’ page navigation is heavily focused on the lead of a Wikipedia article, rarely venturing beyond the first paragraph(Dimitar Dimitrov 2016); therefore, I limit my analysis to the first several links in this section; further analysis with greater computing power could be done on the links within the entire article. Amongst these first several links, I seek to determine if there are any other link locations that reach a specific page with any abnormal frequencies, including the philosophy page. To conduct my analysis, I construct a network using Wikipedia pages as nodes and the links on the page as directed links between nodes. I collected my data using a Breadth-First Search (BFS), meaning once I reach a page that has already been visited, I move on to another root page. With the network, I examine average path lengths to the philosophy page, the neighbors of the philosophy page that most commonly direct to it, and the structure of the first-link network itself. Furthermore, I examine the second link of Wikipedia pages and conduct and analysis of that network as well. My conclusions demonstrate the effectiveness of Wikipedia’s effort to make their introductory sentences and links sufficiently broad.

Introduction

Wikipedia pages are built with the user’s understanding in mind. To ensure consistency across pages and maintain reliability as a credible source, there are extensive guidelines on the structure of each page. As one of the most important components of a Wikipedia page, linked content and the content of the lead paragraph is tightly monitored. Links serve to “provide instant pathways to locations within and outside the project that can increase readers’ understanding of the topic at hand.”(Wikipedia 2023c) Users will click on links when a topic is unfamiliar to them, or if they interested in learning more.

When arriving to a page, a user ought to have the topic explained to them as though they know little to nothing about it. The lead ought to frame the reader so as to “set the scene of the topic.”(Wikipedia 2023c) Wikipedia explains the structure of the lead paragraph:

In Wikipedia, the lead section is an introduction to an article and a summary of its most important contents. It is located at the beginning of the article, before the table of contents and the first heading. It is not a news-style lead or “lede” paragraph.

The average Wikipedia visit is a few minutes long. The lead is the first thing most people will read upon arriving at an article, and may be the only portion of the article that they read. It gives the basics in a nutshell and cultivates interest in reading on—though not by teasing the reader or hinting at what follows. It should be written in a clear, accessible style with a neutral point of view.(Wikipedia 2023c)

Wikipedia goes on to outline how the opening paragraph and sentence ought to be structured. They explain that the “[The opening paragraph] should establish the context in which the topic is being considered by supplying the set of circumstances or facts that surround it. If appropriate, it should give the location and time.”(Wikipedia 2023c) For example, a building’s first link will most likely be its location. Within that paragraph, its opening sentence is critical for my study as it will contain the first link. Editors are instructed that “the first sentence should tell the nonspecialist reader what or who the subject is, and often when or where.”(Wikipedia 2023c) They go on to provide explicit instructions on what the first linked topic ought to be in an article:

The first sentence should provide links to the broader or more elementary topics that are important to the article’s topic or place it into the context where it is notable.

For example, an article about a building or location should include a link to the broader geographical area of which it is a part.

Arugam Bay is a bay on the Indian Ocean in the dry zone of Sri Lanka’s southeast coast.

In an article about a technical or jargon term, the first sentence or paragraph should normally contain a link to the field of study that the term comes from.

In heraldry, tinctures are the colours used to emblazon a coat of arms.

The first sentence of an article about a person should link to the page or pages about the topic where the person achieved prominence.

Harvey Lavan “Van” Cliburn Jr. (July 12, 1934 – February 27, 2013) was an American pianist who achieved worldwide recognition in 1958 at age 23, when he won the first quadrennial International Tchaikovsky Piano Competition in Moscow, at the height of the Cold War.

Exactly what provides the context needed to understand a given topic varies greatly from topic to topic.(Wikipedia 2023c)

The first link of each page will be increasingly broad as you continue to click the first link. These instructions create a picture of how a topic like philosophy can be at the center of Wikipedia’s first link network. Conversely, it is doubtful that such a center exists for any other link placement. Even just the second link in an article can be increasingly specific, moving laterally or even backwards in specificity rather than towards larger hubs such as philosophy. Take one of Wikipedia’s examples in Harvey Lavan “Van” Cliburn Jr; his first link path begins with pianist then continues as follows: piano, keyboard instrument, musical instrument, music, art, creativity, psychology, mind, thought, consciousness, awareness, philosophy. With each passing link you can sense that your destiny on the philosophy page grows closer; the topics are broader and the connection from it to philosophy feels increasingly obvious. However, if we were to follow the second link, International Tchaikovsky Piano Competition, we find ourselves on the following path: Saint Petersburg, Russia, Eastern Europe, Ural Mountains, Eurasia, Europe, peninsulas, mainland, continent, regions, Earth’s surface, hemispheres, etc. Unlike with the first link, the second link gets stuck in geographic limbo without ever getting closer to a central topic like Philosophy. I will explore what a second link network looks like further in my analysis and see that geography, broadly speaking, is the typical destination of pages when clicking the second-link.

There is special focus on the very beginning of a Wikipedia page because that is where users devote most of their attention. Dimitrov et al. utilize click data from Wikipedia’s navigation logs to construct a heat map of where users are clicking the most on Wikipedia pages. The heat map illustrates two clear dark red, high density, lines at the beginning of the page directly where the lead is located, demonstrating that users highest click rate is on links within the first few lines of the opening paragraph. The rest of the page is sparse beyond a preference for links on the left side of pages, a phenomenon the authors themselves do not fully understand.(Dimitar Dimitrov 2016) However, the high click rate within the lead indicates to us that understanding the nature of the network of the first few links in an article is indicative of the nature of the network that users are typically interacting with.

Research has already been done into the size of the Giant Connected Component (GCC) of nodes that connect to the philosophy node. In a study of Wikipedia’s navigability by language, as of 2017, 97.0% of pages in English will reach the philosophy page(Daniel Lamprecht 2016), a slight increase of around 2.5% since 2011.(Wikipedia 2023b) These numbers fluctuate across languages, with some languages have a center on pages such as “Psychology” in Spanish or “Person” in Japanese each with varying sizes but still having the majority of nodes reach these pages(Daniel Lamprecht 2016); my study will only be focused on the English network of Wikipedia. In the future, it would be interesting to study this phenomenon in other languages as I have done with English. In particular, previous studies indicate that Dutch has the smallest GCC with just 67.0% of nodes in its GCC.(Daniel Lamprecht 2016) I would like to compare its network to English to understand this discrepancy. However, the English network is already far large enough for the scope of this study.

If you would like to see how this network is formed beyond clicking through Wikipedia webpages on your own, the online page xefer will quickly build out a network of pages and their first links until you reach the philosophy page. This is a helpful tool that is good to visualize what this can look like in practice. However, it was designed to always reach the philosophy page even for those pages that manage to avoid the philosophy page. It does this by skipping to the second link on a page when it realizes it will not be able to reach the philosophy page through the first link.(xefer 2011) Therefore, we need to construct our own network to understand these disconnected nodes.

To understand how a node can be disconnected, we ought to look at what makes philosophy the center of the network. If you click on the first link on the philosophy page, you go to the existence page, which takes you to the entity page, then right back to the existence page, forming a loop that makes out the “bottom” of this network. These pages are not nearly as central as philosophy, otherwise the phenomena would be about one of them instead. For another node to avoid the philosophy node, it would require a similar cycle or lack any link. Therefore, it is going to be a broad topic as it has to be something that could similarly be in the first sentence of a Wikipedia page. This eliminates hyper-specific pages from consideration despite them being the intuitive guess for what might manage to avoid philosophy. However, these specific pages can eventually lead to the broad pages that manage to cycle without hitting the philosophy page. Furthermore, there can also be pages with no links that function as dead-end pages. Wikipedia recently underwent an effort to remove all true dead-end pages (pages with zero links).(Wikipedia 2023a) Despite these efforts, there remain pages with no links as far as this study is concerned. For example, many sports pages have a lot of links, but they all lie within tables which are not included in this phenomena. For example, on 2011-12 Exeter City F.C. season, there are tons of links but none in the content of the page. All of them are in tables or citations, meaning that this is a dead-end page for the philosophy phenomena. Additionally, this study does not consider links in lists, a choice explained in greater detail in the methods section.

A page’s neighbors will remain within semantically related to that page amongst links in the lead. In a study that constructed Wikipedia’s network using the first ten links in an article as a node’s edges, it was determined that the nodes will form into communities of semantically related terms.(Neven Matas 2015) The mathematics page will be in a community of other topics related to math such as physics. For our sakes, this is an important result as it helps to paint a picture of what the branches stemming from philosophy’s neighbors will look like. For example, we can now expect all scientific terms to be connected in communities allowing them all to pass through the science page on their way to the philosophy page.

Beyond some of the quicker results such as the size of the GCC, the average path length to philosophy, the number of disconnected components, and the nature of networks from other link locations, I will also look at the neighbors of the philosophy node. I also investigate the size of the GCC if the Philosophy node is removed and the size of other remaining large components. I hypothesize and found that the awareness node and its connecting parts form the basis of the GCC and that the component does not shrink by more than 10%. However, if awareness is removed as well, the GCC would shrink dramatically as the awareness node serves as a bridge between all scientific topics and all locations-based topics (buildings, monuments, historical figures).

Methods

All of my analysis and data collection was done using Python 3.10.12.

Plotting Methods

To create the plots needed for my analysis, I used MatPlotLib, Seaborn, and NetworkX’s Drawing Tool.

Results

Conclusions

References

Daniel Lamprecht, Markus Strohmaier. 2016. “Evaluating and Improving Navigability of Wikipedia: A Comparative Study of Eight Language Editions.” OpenSym ’16: Proceedings of the 12th International Symposium on Open Collaboration.
Dimitar Dimitrov, Markus Stromaier. 2016. “Visual Positions of Links and Clicks on Wikipedia.” WWW ’16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web 2.
Neven Matas, Ana Meštrović. 2015. “Extracting Domain Knowledge by Complex Networks Analysis of Wikipedia Entries.” 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) 3.
Wikipedia. 2023a. “Wikipedia:dead-End_pages.” 2023. https://en.wikipedia.org/wiki/Wikipedia:Dead-end_pages.
———. 2023b. “Wikipedia:getting to Philosophy.” 2023. https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy.
———. 2023c. “Wikipedia:manual of Style.” 2023. https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style.
xefer. 2011. “All Roads Lead to “Philosophy".” 2011. https://www.xefer.com/2011/05/wikipedia.